Use INE mobility data to explore the value of CORINE land cover data when modeling human mobility in metropolitan areas.
Can random forest regression with land cover variables improve predictions compared to a simple gravity model?
Given the large number of land cover variables (52, when including both origin and destination data) and their uncertain relationship to human mobility, random forests may help incorporate this data into a model without making assumptions about interactions (i.e. between origin and destination variables) or transformations (i.e. log scale). Additionally, random forest models indicate the relative importance of each variable, which may also be interesting.
Which model provides the best predictions for new metropolitan areas?
Given that INE provides mobility data that covers nearly all residents of Spain, a model of mobility is more useful if it can provide accurate predictions for other areas. To explore this, I use mobility data for the 40 largest metropolitan areas in Spain and run each model (RF, Gravity) 40 times. Each time, one city serves as the test data while the other 39 cities make up the training set. I then compare the predictions to observed values using root mean square errors (RMSE) and the Common Part of Commuters (CPC) metric that is common in the literature.
When mapped, are there substantive difference between the model predictions?
I map flows for selected metropolitan areas, to visually examine the differences between the model predictions.
INE provided data on flows between mobility areas for all Wednesdays and Sundays in 2021. Here, I limited the data to 13 October 2021, because this was the Wednesday in which Spain had the fewest daily Covid cases, as well as few restrictions. Hopefully mobility on this date best approximates “the new normal.” I include only flows within each metropolitan area, not flows between them.
Gravity Model
Linear model of flows between mobility areas with the following independent variables: population of origin, population of destination, distance between destination and origin. All variables are on the log scale.
Random Forest (RF)
Random forest regression with 500 trees including the variables in the gravity model and the 52 land cover variables (total area of each land cover type). Modeled using the ranger package.
rf <- ranger(dependent.variable.name = 'flujo', data = data, num.threads = 8)
gravity <- lm(log(flujo) ~ log(pob_destino) + log(pob_residencia) + log(dist), data = data)
data <- data %>% mutate(flujo_pred_rf = predictions(predict(rf, .)),
flujo_pred_grav = exp(predict(gravity, .)),
errors_rf = flujo-flujo_pred_rf,
errors_grav = flujo-flujo_pred_grav,
min_rf = ifelse(flujo<=flujo_pred_rf,flujo,flujo_pred_rf),
min_grav = ifelse(flujo<=flujo_pred_grav,flujo,flujo_pred_grav))
tibble(Model = c("RF","Gravity"),
RMSE = c(sqrt(mean(data$errors_rf^2)),
sqrt(mean(data$errors_grav^2))),
CPC = c(sum(2*data$min_rf)/(sum(data$flujo)+sum(data$flujo_pred_rf)),
sum(2*data$min_grav)/(sum(data$flujo)+sum(data$flujo_pred_grav))))
## # A tibble: 2 × 3
## Model RMSE CPC
## <chr> <dbl> <dbl>
## 1 RF 44.0 0.893
## 2 Gravity 151. 0.581
When modeling the full 40-city dataset, I find that the random forest model is superior in both RMSE and CPC. For random forest regressions, we can see the degree to which each variable contributes to the model:
It appears that the origin variables ("_residencia“) are frequently more important that the destination ones (”_destino"). The following table summarizes the mean importance of the two types of variables:
## # A tibble: 2 × 2
## `Land Cover Variables` `Mean Importance`
## <chr> <dbl>
## 1 Destination 14744895.
## 2 Origin 11298130.
This indicates that human mobility in the 40 metropolitan areas is more dependent on “push” factors than “pull” ones.
The following summarizes the 40 model runs in which each city, sequentially, serves as the test data while the other 39 serve as the training data.
## Cities with most accurate RF predictions:
## # A tibble: 6 × 7
## city rmse_rf rmse_lm better_rmse cpc_rf cpc_lm better_cpc
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 Alicante/Alacant 97.4 142. RF 0.758 0.585 RF
## 2 Lleida 74.2 99.0 RF 0.744 0.614 RF
## 3 Girona 82.5 127. RF 0.737 0.581 RF
## 4 Gijón 93.2 120. RF 0.736 0.637 RF
## 5 Castellón/Castelló 134. 204. RF 0.735 0.471 RF
## 6 León 105. 140. RF 0.731 0.563 RF
## Cities with least accurate RF predictions:
## # A tibble: 6 × 7
## city rmse_rf rmse_lm better_rmse cpc_rf cpc_lm better_cpc
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 Madrid 148. 88.7 Gravity 0.511 0.649 Gravity
## 2 Vitoria/Gasteiz 589. 747. RF 0.534 0.237 RF
## 3 Badajoz 230. 273. RF 0.595 0.354 RF
## 4 Jaén 194. 237. RF 0.614 0.416 RF
## 5 Donostia-San Sebastián 262. 305. RF 0.615 0.448 RF
## 6 Ourense 205. 258. RF 0.623 0.397 RF
## Plot of CPC for each test city:
For every city, save Madrid, the random forest model outperforms the gravity model.